Support Vector Model with Social Media Data

SVM explanation

An SVM classifier creates a line (plane or hyper-plane, depending upon the dimensionality of the data) in an N-dimensional space to classify data points that belong to two separate classes. It is also noteworthy that the original SVM classifier had this objective and was originally designed to solve binary classification problems, however unlike, say, linear regression that uses the concept of line of best fit, which is the predictive line that gives the minimum Sum of Squared Error (if using OLS Regression), or Logistic Regression that uses Maximum Likelihood Estimation to find the best fitting sigmoid curve, Support Vector Machines uses the concept of Margins to come up with predictions.

SVM algorithm predicts the classes. One of the classes is identified as 1 while the other is identified as -1.
As all machine learning algorithms convert the business problem into a mathematical equation involving unknowns. These unknowns are then found by converting the problem into an optimization problem. As optimization problems always aim at maximizing or minimizing something while looking and tweaking for the unknowns, in the case of the SVM classifier, a loss function known as the hinge loss function is used and tweaked to find the maximum margin. Hinge Loss Function
For ease of understanding, this loss function can also be called a cost function whose cost is 0 when no class is incorrectly predicted. However, if this is not the case, then error/loss is calculated. The problem with the current scenario is that there is a trade-off between maximizing margin and the loss generated if the margin is maximized to a very large extent. To bring these concepts in theory, a regularization parameter is added. Loss function for SVM
As is the case with most optimization problems, weights are optimized by calculating the gradients using advanced mathematical concepts of calculus viz. partial derivatives. Gradients
The gradients are updated only by using the regularization parameter when there is no error in the classification while the loss function is also used when misclassification happens.

Experiment

We have a dataset with personal data from a social media company. This data's features include age, salary, and a factor variable stating whether the customer purchase the item they were advertised. I am using the scikit package to perform this analysis.

Step 1- Import packages and clean data

I am going to use pandas and numpy to clean our data
matplotlib will be used to visualize

In [11]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [12]:

dataset = pd.read_csv('mediaads.csv')
dataset

Out[12]:

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
...	...	...	...	...	...
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1

400 rows × 5 columns

Out dataset has 4 features and one target variable. Age and Estimated Salary are the only numerical features that will associate with a purchase, so I am going to save these to X as predictors. I saved the purchase decision as y since it's the target.

In [13]:

X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

In [ ]:

Step 2 Split Data

I will use a .25/.75 ratio in test/train data. sklearn will do most of this for me with the train_test_split function. StandardScaler() will standardize the data automatically as well.

In [14]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [15]:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 3- Create model

Using SVC function from sklearn we can choose the Radial Bias Function as one of our parameters.

The fit method is a fundamental part of the Scikit-Learn library. It’s used to train a machine learning model on a dataset. Specifically, the fit method takes in a dataset (typically represented as a 2D array or matrix) and a set of labels, and then fits the model to the data.

In [16]:

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

Out[16]:

SVC(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Step 4- Predict

Using the predict() method, I can predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model. predict(X_new) ), and returns the learned label for each object in the array. We can analyze our models perforance with the confusion matrix function.

In [17]:

y_pred = classifier.predict(X_test)

In [18]:

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test,y_pred)

[[64  4]
 [ 3 29]]

Out[18]:

0.93

An accuracy of 93% is pretty good. There are 64 true positives with 4 false. There are 29 true negatives with 3 false. This classification method can be very helpful for suggesting the product to new customers with this high accuracy.

Step 5 - Visualize

Using Matplotlib we can see the false positives and negatives. These can be further analyzed in the future. I can already see that redrawing the line could get one or 2 of the false negatives to positives.

In [19]:

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

In [ ]: